LDA starts from a fixed number of topics. Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics. Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents. Reference: https://medium.com/intuitionmachine/the-two-paths-from-natural-language-processing-to-artificial-intelligence-d5384ddbfc18
I'll start by reading the data in from the CSV that was previously processed by another notebook.
import pandas as pd
import os
abstracts_df = pd.read_csv(os.path.join('data', 'processed', 'abstracts.csv'))
# https://www.nsf.gov/awardsearch/showAward?AWD_ID=2053734&HistoricalAwards=false
abstracts_df.dropna(subset=['award_id', 'abstract'], inplace=True)
In NLP tasks, usually, we need to normalize the texts before processing them. My normalization process consist in: converting the text in string, convert the text to lowercase, exclude the punctuation, apply a lemmatizer, and finally remove the words with less than 3 letters.
from nltk.corpus import stopwords #stopwords
from nltk.stem import WordNetLemmatizer # lemmatizer from WordNet
#from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
#stop-words
stop_words=set(nltk.corpus.stopwords.words('english'))
def normalize_text(text):
# Convert to text
normalized_text = str(text)
# Convert to lowercase
normalized_text = normalized_text.lower()
# Convert the words in tokens separated by spaces and transform each word in its lemma
# -- e.g., "criteria" -> "criterion"
lemmatizer = WordNetLemmatizer()
# "geocoordinates" -> "geocoordin"
# stemmer = PorterStemmer() I avoid this because some words are cut
word_tokens=word_tokenize(normalized_text)
tokens=[lemmatizer.lemmatize(w) for w in word_tokens if w not in stop_words and len(w)>3]
normalized_text=" ".join(tokens)
return normalized_text
text = "National efforts to digitize natural history collections have transformed previously siloed, unstandardized resources into a networked, openly available information nexus usable to meet grand scientific and societal challenges. Despite these enormous strides, major bottlenecks in this digitization process still exist, especially in areas where automation approaches have been most challenging. In particular, capturing analog specimen data into digital format and converting text descriptions of collecting locations into mappable geocoordinates, have remained boutique efforts. Because of these bottlenecks, as many as 91% of digitized specimens are missing key elements that hamper ability to use these specimen records more effectively. This project will develop key workflows to dramatically increase the speed at which specimen data can be captured and made available broadly to data providers and consumers. These workflows include novel approaches that use both computer and human intelligence to advance our ability to capture specimen information. One key workflow focuses on the challenge of automated conversion of imaged specimen labels into properly formatted and usable digital text. Critical to the success of this workflow are human validation checkpoints that will be implemented using a popular citizen science platform, Notes from Nature. A second workflow focuses on new tools that take advantage of previous efforts to assign mappable coordinates based on specimen collection location to automatically add such mapping information for specimens missing those data. Finally, this effort will create tools for easy access to these new data in and out of common use databases, making the data immediately available for museum providers and researchers alike. This effort will connect public participation in science to these novel tools and technologies. Further, it will train diverse graduate students and undergraduate students in bioinformatics and museum science.<br/><br/>This effort has three design goals that together will dramatically reduce the digitization gap in museum specimen data. The first design goal will combine machine learning methods with public participation in scientific research (PPSR) via the successful Notes from Nature (NfN) project to speed up label digitization and facilitate obtaining locality data. A key part of the first design goal utilizes supervised machine learning approaches and object character recognition (OCR) when possible but also includes “humans in the loop” using the NfN platform to gather fast quality feedback from human volunteers at key points. This approach also provides a means to create high-quality training datasets needed for improving automation steps, ultimately further reducing human effort. The second design goal will integrate locality data interpretation through GEOLocate with a Biodiversity Enhanced Locality Service (BELS), which will make it possible to look up pre-existing localities that have been georeferenced using best practices. A third goal is to connect these workflows and services to Symbiota, a community digitization hub, to allow easy inflow and outflow of content back to digitization networks. Providers will be able to easily access new data along with associated metadata about processing steps, all returned using established standards and best practices. The key to this effort will be engagement with the community, including researchers, collections staff, and Zooniverse volunteers. Engagement will focus on virtual training and working with an advisory committee in order to grow capacity and community involvement.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria."
normalized_text = normalize_text(text)
print(f'Original text: {text}')
print('-------------------------------------------------------------------------------------------')
print(f'Normalized text: {normalized_text}')
normalized_abstracts = abstracts_df['abstract'].apply(normalize_text)
[nltk_data] Downloading package punkt to /home/juan/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to /home/juan/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to /home/juan/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package omw-1.4 to /home/juan/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date!
Original text: National efforts to digitize natural history collections have transformed previously siloed, unstandardized resources into a networked, openly available information nexus usable to meet grand scientific and societal challenges. Despite these enormous strides, major bottlenecks in this digitization process still exist, especially in areas where automation approaches have been most challenging. In particular, capturing analog specimen data into digital format and converting text descriptions of collecting locations into mappable geocoordinates, have remained boutique efforts. Because of these bottlenecks, as many as 91% of digitized specimens are missing key elements that hamper ability to use these specimen records more effectively. This project will develop key workflows to dramatically increase the speed at which specimen data can be captured and made available broadly to data providers and consumers. These workflows include novel approaches that use both computer and human intelligence to advance our ability to capture specimen information. One key workflow focuses on the challenge of automated conversion of imaged specimen labels into properly formatted and usable digital text. Critical to the success of this workflow are human validation checkpoints that will be implemented using a popular citizen science platform, Notes from Nature. A second workflow focuses on new tools that take advantage of previous efforts to assign mappable coordinates based on specimen collection location to automatically add such mapping information for specimens missing those data. Finally, this effort will create tools for easy access to these new data in and out of common use databases, making the data immediately available for museum providers and researchers alike. This effort will connect public participation in science to these novel tools and technologies. Further, it will train diverse graduate students and undergraduate students in bioinformatics and museum science.<br/><br/>This effort has three design goals that together will dramatically reduce the digitization gap in museum specimen data. The first design goal will combine machine learning methods with public participation in scientific research (PPSR) via the successful Notes from Nature (NfN) project to speed up label digitization and facilitate obtaining locality data. A key part of the first design goal utilizes supervised machine learning approaches and object character recognition (OCR) when possible but also includes “humans in the loop” using the NfN platform to gather fast quality feedback from human volunteers at key points. This approach also provides a means to create high-quality training datasets needed for improving automation steps, ultimately further reducing human effort. The second design goal will integrate locality data interpretation through GEOLocate with a Biodiversity Enhanced Locality Service (BELS), which will make it possible to look up pre-existing localities that have been georeferenced using best practices. A third goal is to connect these workflows and services to Symbiota, a community digitization hub, to allow easy inflow and outflow of content back to digitization networks. Providers will be able to easily access new data along with associated metadata about processing steps, all returned using established standards and best practices. The key to this effort will be engagement with the community, including researchers, collections staff, and Zooniverse volunteers. Engagement will focus on virtual training and working with an advisory committee in order to grow capacity and community involvement.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. ------------------------------------------------------------------------------------------- Normalized text: national effort digitize natural history collection transformed previously siloed unstandardized resource networked openly available information nexus usable meet grand scientific societal challenge despite enormous stride major bottleneck digitization process still exist especially area automation approach challenging particular capturing analog specimen data digital format converting text description collecting location mappable geocoordinates remained boutique effort bottleneck many digitized specimen missing element hamper ability specimen record effectively project develop workflow dramatically increase speed specimen data captured made available broadly data provider consumer workflow include novel approach computer human intelligence advance ability capture specimen information workflow focus challenge automated conversion imaged specimen label properly formatted usable digital text critical success workflow human validation checkpoint implemented using popular citizen science platform note nature second workflow focus tool take advantage previous effort assign mappable coordinate based specimen collection location automatically mapping information specimen missing data finally effort create tool easy access data common database making data immediately available museum provider researcher alike effort connect public participation science novel tool technology train diverse graduate student undergraduate student bioinformatics museum science. effort three design goal together dramatically reduce digitization museum specimen data first design goal combine machine learning method public participation scientific research ppsr successful note nature project speed label digitization facilitate obtaining locality data part first design goal utilizes supervised machine learning approach object character recognition possible also includes human loop using platform gather fast quality feedback human volunteer point approach also provides mean create high-quality training datasets needed improving automation step ultimately reducing human effort second design goal integrate locality data interpretation geolocate biodiversity enhanced locality service bel make possible look pre-existing locality georeferenced using best practice third goal connect workflow service symbiota community digitization allow easy inflow outflow content back digitization network provider able easily access data along associated metadata processing step returned using established standard best practice effort engagement community including researcher collection staff zooniverse volunteer engagement focus virtual training working advisory committee order grow capacity community involvement. award reflects statutory mission deemed worthy support evaluation using foundation intellectual merit broader impact review criterion
The input for LDA is a bag of words where each document is a row and each column has the count of words in the corpus.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorized_text = vectorizer.fit_transform(normalized_abstracts)
# (num_abstracts, num_words)
print(vectorized_text.shape)
(13159, 43923)
from sklearn.decomposition import LatentDirichletAllocation
num_topics = 10
lda_model=LatentDirichletAllocation(
n_components=num_topics,
learning_method='online',
random_state=92
)
lda_topics = lda_model.fit_transform(vectorized_text)
print(lda_topics.shape) # (num_abstracts, num_topics)
(13159, 10)
After calculating the topics, we can get the top 10 words associated with each topic
# Most important words for each topic
vocabulary = vectorizer.get_feature_names_out()
n_top_words = 100
topic_word_freq = {}
for index, component in enumerate(lda_model.components_):
vocab_comp = zip(vocabulary, component)
sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:n_top_words]
import_words = [x[0] for x in sorted_words]
topic_word_freq[index] = import_words[:100]
print(f"Topic {index}: {', '.join(import_words[:10])}")
print("\n")
Topic 0: covid, 19, project, impact, technology, health, broader, disease, using, virus Topic 1: quantum, theory, project, problem, mathematical, study, research, using, equation, award Topic 2: model, project, ocean, earth, climate, using, process, impact, temperature, change Topic 3: data, project, system, model, learning, research, network, using, impact, design Topic 4: student, research, project, stem, support, program, science, education, learning, university Topic 5: research, project, data, community, change, impact, social, using, study, support Topic 6: material, research, project, high, property, using, structure, impact, energy, award Topic 7: cell, protein, project, plant, research, gene, biology, biological, function, using Topic 8: water, carbon, chemical, soil, chemistry, project, organic, energy, process, reaction Topic 9: wave, physic, star, award, using, particle, plasma, galaxy, energy, matter
Sometimes, wordclouds are better to identify the topics.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
def generate_wordcloud(tup):
wordcloud = WordCloud(background_color='white',
max_words=50, max_font_size=40,
random_state=42
).generate(str(tup))
return wordcloud
fig,axes = plt.subplots(5, 2, figsize=(15, 25))
for i in range(5):
for j in range(2):
ax = axes[i, j]
ax.imshow(generate_wordcloud(topic_word_freq[5*j + i]), interpolation="bilinear")
ax.axis('off')
ax.set_title(f"Topic {5*j + i}", fontsize=30)
We'll reduce the dimensions of the topics to see them in a 2D graph.
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1)
# reduce dimension to 2 using tsne
tsne_lda = tsne_model.fit_transform(lda_topics)
[t-SNE] Computing 91 nearest neighbors... [t-SNE] Indexed 13159 samples in 0.009s... [t-SNE] Computed neighbors for 13159 samples in 0.615s... [t-SNE] Computed conditional probabilities for sample 1000 / 13159 [t-SNE] Computed conditional probabilities for sample 2000 / 13159 [t-SNE] Computed conditional probabilities for sample 3000 / 13159 [t-SNE] Computed conditional probabilities for sample 4000 / 13159 [t-SNE] Computed conditional probabilities for sample 5000 / 13159 [t-SNE] Computed conditional probabilities for sample 6000 / 13159 [t-SNE] Computed conditional probabilities for sample 7000 / 13159 [t-SNE] Computed conditional probabilities for sample 8000 / 13159 [t-SNE] Computed conditional probabilities for sample 9000 / 13159 [t-SNE] Computed conditional probabilities for sample 10000 / 13159 [t-SNE] Computed conditional probabilities for sample 11000 / 13159 [t-SNE] Computed conditional probabilities for sample 12000 / 13159 [t-SNE] Computed conditional probabilities for sample 13000 / 13159 [t-SNE] Computed conditional probabilities for sample 13159 / 13159 [t-SNE] Mean sigma: 0.000000 [t-SNE] KL divergence after 250 iterations with early exaggeration: 78.629623 [t-SNE] KL divergence after 1000 iterations: 1.161142
import numpy as np
unnormalized = np.matrix(lda_topics)
doc_topic = unnormalized/unnormalized.sum(axis=1)
lda_keys = []
for index in range(abstracts_df.shape[0]):
lda_keys += [doc_topic[index].argmax()]
lda_df = pd.DataFrame(tsne_lda, columns=['x','y'])
lda_df['abstract'] = abstracts_df['abstract']
lda_df['award_id'] = abstracts_df['award_id']
lda_df['topic'] = lda_keys
lda_df['topic'] = lda_df['topic'].map(int)
lda_df
| x | y | abstract | award_id | topic | |
|---|---|---|---|---|---|
| 0 | -1.275147 | 16.423231 | National efforts to digitize natural history c... | 2027234.0 | 3 |
| 1 | 33.612213 | 5.749017 | An award is made to the Natural History Museum... | 2018207.0 | 4 |
| 2 | -6.057844 | 24.984032 | Current software for user authentication relie... | 2039373.0 | 3 |
| 3 | 25.468138 | 15.642528 | This collaborative project comprised of ten aw... | 2001394.0 | 5 |
| 4 | 27.775192 | 32.858318 | Cyberlearning technologies that incorporate ro... | 2030441.0 | 4 |
| ... | ... | ... | ... | ... | ... |
| 13154 | 7.274016 | -32.160351 | Recent advances in artificial intelligence (AI... | 2008228.0 | 5 |
| 13155 | 29.610926 | -9.462381 | Data visualization is a key component to disco... | 2006710.0 | 4 |
| 13156 | 64.865776 | -51.218269 | The broader impact/commercial potential of thi... | 2035899.0 | 2 |
| 13157 | -38.417030 | -18.448893 | Gamma-ray astronomy impacts a broad range of k... | 2013109.0 | 6 |
| 13158 | 40.571548 | -8.988461 | Controlling cell differentiation is critical w... | 2033997.0 | 4 |
13159 rows × 5 columns
import bokeh.plotting as bp
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import show, output_notebook
output_notebook()
plot_lda = bp.figure(
plot_width=700,
plot_height=600,
title="LDA topic visualization",
tools="pan,wheel_zoom,box_zoom,reset,hover",
x_axis_type=None, y_axis_type=None, min_border=1)
colormap = np.array(["#6d8dca", "#69de53", "#723bca", "#c3e14c", "#c84dc9", "#68af4e", "#6e6cd5",
"#e3be38", "#4e2d7c", "#5fdfa8", "#d34690", "#3f6d31", "#d44427", "#7fcdd8", "#cb4053", "#5e9981",
"#803a62", "#9b9e39", "#c88cca", "#e1c37b", "#34223b", "#bdd8a3", "#6e3326", "#cfbdce", "#d07d3c",
"#52697d", "#194196", "#d27c88", "#36422b", "#b68f79"])
source = ColumnDataSource(data=dict(x=lda_df['x'], y=lda_df['y'],
color=colormap[lda_keys],
abstract=lda_df['abstract'],
topic=lda_df['topic'],
award_id=lda_df['award_id']))
plot_lda.scatter(source=source, x='x', y='y', color='color')
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips={"abstract":"@abstract",
"topic":"@topic", "award_id":"@award_id"}
show(plot_lda)